Adjective Density as a Text Formality Characteristic for Automatic Text Classification: A Study Based on the British National Corpus

نویسندگان

  • Alex Chengyu Fang
  • Jing Cao
چکیده

In this article, we report significant findings resulting from an investigation into the correlation between adjective density, calculated as the proportion of adjectives in word tokens, and degrees of text formality as part of an attempt to examine the potential application of adjectives in automatic text classification and identification. Correlations obtained from the training corpus will be compared with human ranking of the text categories concerned in the study and then adapted to unseen data in the test set. A linear regression analysis suggests a strong correlation between degrees of text formality and adjective density. With a weighted average F-measure of 0.606 achieved by a Naïve Bayes classifier, the research establishes adjectives as a powerful differentia of text categories amongst the open word classes, an important feature that has been generally ignored by past studies in automatic text categorization. The empirical findings suggest that the use of adjective density will lead to enhanced practical systems for automatic text classification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-based Study of Latinate Words in Contemporary English

The English language has borrowed extensively from Latin. Despite their widely acknowledged importance, however, Latinate words are under-studied especially quantitatively according to their use across different linguistic settings. In this paper, we report a corpus-based survey of Latinate words and their use in contemporary English. The objective is to chart the use and distribution of Latina...

متن کامل

Situation and Text: Representation of Migrants Whilst the Escalation of Refugee Crisis in Great Britain as Compared to Russia

Increasing migration is a vital concern for a globalizing sociocultural environment in today’s world. The UK and developed European countries have become an attractive destination for asylum seekers (labelled as “migrants”) in the past decade. The rapid rise in the number of asylum seekers, which was labelled “migration crisis” (Ruz, 2015), made this topic an integral part of scientific discuss...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Statistically-constrained shallow text marking: techniques, evaluation paradigm and results

We present three natural language marking strategies based on fast and reliable shallow parsing techniques, and on widely available lexical resources: lexical substitution, adjective conjunction swaps, and relativiser switching. We test these techniques on a random sample of the British National Corpus. Individual candidate marks are checked for goodness of structural and semantic fit, using bo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009